This clean dataset contains 1,599 red wines with 11 chemical properties of the wine and a quality rating between 0 (horrible) and 10 (exceptional).
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Number of positive outliers: 155
## Number of negative outliers: 0
An ititial summary gives an overview of the data set.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of quality is normally shaped with a low of 3 and a high of 8. Most observations are either a 5 or 6. The posible values of quality score that could have been given were from 0 to 10.
Residual sugar is positively skewed, plotting the histogram next to a box plot
will help make sense of the distribution.
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
The box plot above shows the very large number of positive outliers. The number of outliers is caluclated and printed out below.
## Number of positive outliers: 155
## Number of negative outliers: 0
For a dataset of 1,599 observations that makes %9.7 of the data outliers. This is clearly not an error in measurement and the outliers will remain in the dataset.
Plotting the residual sugar histogram on a logarithmic scale creates a more normal distribution of data.
The chloride distribution is positively skewed and its box plot is below.
The chloride distribution looks a lot like the residual sugar distribution, with the addition of some outliers in the negative direction as well. The number of positive and negative outliers is printed below.
## Number of positive outliers: 103
## Number of negative outliers: 9
These outliers seem like valid measurements and will be left in the data set.
Plotting the chloride data on a logarithmic scale makes the distribution more normal.
Sulphate distribution is also positively skewed but the distribution is more compact than the that of sugars or chlorides. Once again a box plot can help visualise outliers.
## Number of positive outliers: 59
## Number of negative outliers: 0
The 59 sulphate outliers will remain in the data set.
Alcohol % data looks slightly positively skewed based on the histogram above.
## Number of positive outliers: 13
## Number of negative outliers: 0
There are only 13 outliers for the alcohol distribution.
The fixed acidity distribution looks slightly positively skewed.
## Number of positive outliers: 49
## Number of negative outliers: 0
The volatile acid distribution looks bimodal with a positive skew.
## Number of positive outliers: 19
## Number of negative outliers: 0
The spikes at 0.0 and 0.5 indicate a measurement rounding. There is also a slight spike at 0.25, this should be taken into account when discussing the citric acid data.
## Number of positive outliers: 1
## Number of negative outliers: 0
The one positive outlier of 1.0, combined with the observed spikes at 0.0, 0.25 and 0.5 does indicate a observation that should be removed from the data set. This value is removed for the remainder of this investigation.
## Number of positive outliers: 30
## Number of negative outliers: 0
These outliers are believed to be valid measurements and are left in the dataset.
Total sulfur dioxide is again positively skewed. There are a few observations near 300 that look very far removed from the rest of the distribution.
## Number of positive outliers: 55
## Number of negative outliers: 0
There are more outliers than just the two near 300 mg/dm^3. The two outliers near 300 are so large they are likely errors and will be removed from the data set.
Density and pH are normally distributed.
Instead of looking at total sulfur dioxide and free sulfur dioxide a new variable can be created that is the ratio of free sulfur dioxide.
Sulfur dioxide ratio is normally distributed. The spike at 0.25 and possibly at 0.5 and 0.75 should be noted as possible measurement roundings.
## Number of positive outliers: 4
## Number of negative outliers: 0
Other variables that may be of interest are the total acid content, and ratio’s of certain types of acid (fixed, volatile, and citric) to the total acid content. The way these acid amounts affect the the wine could lead to insights.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
To gain information about which variables affect wine quality the data has a ordered factor added of low, medium, and high quality added. A low, medium, and high quality wine is defined as a score of 3 or 4, 5 or 6, and 7 or 8 respectively.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 8.0 Min. : 4.900 Min. :0.1200 Min. :0.0000
## 1st Qu.: 470.5 1st Qu.: 7.400 1st Qu.:0.3100 1st Qu.:0.3000
## Median : 930.0 Median : 8.800 Median :0.3700 Median :0.4000
## Mean : 829.4 Mean : 8.856 Mean :0.4065 Mean :0.3737
## 3rd Qu.:1089.5 3rd Qu.:10.150 3rd Qu.:0.4900 3rd Qu.:0.4900
## Max. :1585.0 Max. :15.600 Max. :0.9150 Max. :0.7600
## residual.sugar chlorides free.sulfur.dioxide
## Min. :1.200 Min. :0.01200 Min. : 3.00
## 1st Qu.:2.000 1st Qu.:0.06200 1st Qu.: 6.00
## Median :2.300 Median :0.07300 Median :11.00
## Mean :2.657 Mean :0.07615 Mean :13.76
## 3rd Qu.:2.700 3rd Qu.:0.08550 3rd Qu.:17.00
## Max. :8.900 Max. :0.35800 Max. :54.00
## total.sulfur.dioxide density pH sulphates
## Min. : 7.00 Min. :0.9906 Min. :2.880 Min. :0.3900
## 1st Qu.: 17.00 1st Qu.:0.9947 1st Qu.:3.200 1st Qu.:0.6550
## Median : 26.00 Median :0.9958 Median :3.280 Median :0.7400
## Mean : 32.58 Mean :0.9961 Mean :3.291 Mean :0.7456
## 3rd Qu.: 43.00 3rd Qu.:0.9974 3rd Qu.:3.380 3rd Qu.:0.8250
## Max. :106.00 Max. :1.0032 Max. :3.780 Max. :1.3600
## alcohol quality sulfur.dioxide.ratio total.acidity
## Min. : 9.20 Min. :7.000 Min. :0.1364 Min. : 5.320
## 1st Qu.:10.80 1st Qu.:7.000 1st Qu.:0.3333 1st Qu.: 8.100
## Median :11.60 Median :7.000 Median :0.4474 Median : 9.470
## Mean :11.51 Mean :7.084 Mean :0.4365 Mean : 9.636
## 3rd Qu.:12.20 3rd Qu.:7.000 3rd Qu.:0.5353 3rd Qu.:10.990
## Max. :14.00 Max. :8.000 Max. :0.7556 Max. :17.045
## quality.grade
## low : 0
## medium: 0
## high :215
##
##
##
## X fixed.acidity volatile.acidity citric.acid
## Min. : 19.0 Min. : 4.600 Min. :0.2300 Min. :0.0000
## 1st Qu.: 474.5 1st Qu.: 6.800 1st Qu.:0.5800 1st Qu.:0.0200
## Median : 853.5 Median : 7.500 Median :0.6800 Median :0.0750
## Mean : 848.7 Mean : 7.850 Mean :0.7275 Mean :0.1603
## 3rd Qu.:1289.8 3rd Qu.: 8.375 3rd Qu.:0.8838 3rd Qu.:0.2675
## Max. :1522.0 Max. :12.500 Max. :1.5800 Max. :0.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 1.200 Min. :0.04500 Min. : 3.00
## 1st Qu.: 1.900 1st Qu.:0.06825 1st Qu.: 5.00
## Median : 2.100 Median :0.08000 Median : 9.00
## Mean : 2.673 Mean :0.08744 Mean :11.74
## 3rd Qu.: 2.750 3rd Qu.:0.09200 3rd Qu.:15.00
## Max. :12.900 Max. :0.26700 Max. :41.00
## total.sulfur.dioxide density pH sulphates
## Min. : 7.00 Min. :0.9934 Min. :3.050 Min. :0.3300
## 1st Qu.: 13.25 1st Qu.:0.9957 1st Qu.:3.303 1st Qu.:0.4925
## Median : 25.00 Median :0.9966 Median :3.380 Median :0.5600
## Mean : 33.89 Mean :0.9966 Mean :3.395 Mean :0.5695
## 3rd Qu.: 47.75 3rd Qu.:0.9976 3rd Qu.:3.500 3rd Qu.:0.6000
## Max. :119.00 Max. :1.0010 Max. :3.900 Max. :1.1200
## alcohol quality sulfur.dioxide.ratio total.acidity
## Min. : 8.40 Min. :3.000 Min. :0.0531 Min. : 5.270
## 1st Qu.: 9.60 1st Qu.:4.000 1st Qu.:0.2760 1st Qu.: 7.572
## Median :10.00 Median :4.000 Median :0.3980 Median : 8.325
## Mean :10.23 Mean :3.839 Mean :0.3942 Mean : 8.738
## 3rd Qu.:11.00 3rd Qu.:4.000 3rd Qu.:0.5000 3rd Qu.: 9.330
## Max. :13.10 Max. :4.000 Max. :0.7083 Max. :13.450
## quality.grade
## low :62
## medium: 0
## high : 0
##
##
##
There are 1599 red wines in this data set with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality)
All properties with the exception of quality are chemical properties of the wine. Quality is a score between 1 and 10 rated by at least 3 wine experts.
Observations:
The quality score ranges from 3 to 8 with most wines being scored a 5 or 6
The median quality is a 6, the mean is a 5.636
Residual sugar, chlorides, sulphates, fixed acidity, and sulfur dioxide are long tailed data
The main feature of interest in the data set is the quality score. Although quality is a subjective measure it is the ultimate thing of interest to wine manufacturers and consumers. Additionally, alcohol content is another variable of interest to some wine consumers. Some may value a high alcohol content wine even if it results in a lower quality score.
Investigating the chemical properties of the wine and their affect on quality is the goal of this analysis. Based on the univariate plots and summary data it appears that the main properties affecting quality are acid content, alcohol content, and sulphates. These relationships are explored further in the bivaraite and multivariate sections.
A new variable sulfur dioxide ratio was created from free sulfur dioxide and total sulfur dioxide. A variable of total acid content was created from the sum of volatile, free, and citric acid. A ordered factor was created from the quality variable with low, medium, and high quality wines being scored a 3 or 4, 5 or 6, and 7 or 8 respectively.
Residual sugar, chlorides, sulphates, and sulfur dioxide are all positively skewed long tailed data.
Volatile acidity looks to be a bimodal distribution. The strangest distribution is with the citric acid content; there are large spikes at 0, .25 and .5 g/dm^3. This citric acid distribution is likely because of measurement rounding.
Plotting each variable against one another is a good way to start the bivariate analysis.
Looking at the correlation values, quality is positively correlated with alcohol, sulphates, and citric acid. Quality is negatively correlated with volatile acidity. Alcohol is positively correlated with pH and negatively correlated with total sulfur dioxide, chlorides and volatile acidity. These will be the relationships the bivariate analysis focuses on.
The box plots above give a clearer overview of how alcohol content, volatile acidity, citric acid, and sulphates relate to quality. Visualizing these data by overlaying the distribution’s with box plots makes these relationships clearer.
The other variable of interest, alcohol, is a continuous variable and better explored using scatter plots.
The relationships between alcohol and its correlated variables is not clear from the scatter plots.
The feature of interest in the data set (quality) varied with alcohol content, volatile acidity, citric acid, and suphates. The strongest correlations are with alcohol content, volatile acidity and citric acid content.
When looking at the distributions of volatile acidity split by quality grade outliers can be seen for both high and low quality wines. However, 5 of the 6 outliers for the high quality wines aren’t even larger than Q3 for the low quality wines.
Looking at the citric acid content of the high quality wines it is observed that although the median of the distribution lies higher than the low quality wines, there are a significant number of low values. There are even some outliers in the high quality distribution with a citric acid content of 0.
The strongest relationship found is between the alcohol content and the quality. As alcohol content increases so does the quality of the wine. The other strong relationships are between citric acid content and quality; and volatile acid content and quality. As citric acid content increases so does the quality of wine. As volatile acid content decreases so does the quality of the wine.
The side by side plots of variables of interest vs. alcohol split by quality grade make the relationships clearer. As noted before, the horizontal lines on the citric acid plots are visible. The low quality wines are split with about half having a volatile acidity of above 0.8 g/dm^3, almost all the high quality wines have volatile acidity lower than 0.8 g/dm^3.
Removing the medium quality wines (score 5 or 6) and coloring the low and high quality wines makes it easier to visualize the relationship between quality, and the variables of interest.
By looking at separated values of low and high quality wines the relationship between these variables is made clearer. Low citric acid content correlates with low quality wine (with the exception of some outliers). High alcohol content and high citric acid content correlate with high quality wine.
It is surprising that high alcohol content correlates with high quality. It is interesting that citric acid plays such an important role in determining the quality of the wine.
This box plot was chosen because it shows the relationship between quality and citric acid content. As quality increases so does citric acidity.
An interesting feature of this plot is that as total acidity increases, volatile acidity decreases. Since total acidity is a function of volatile acidity, this result was unexpected. This plot was chosen because it shows that high quality wines have both low and high total acidity. However, mostly the low quality wines have high volatile acidity. This demonstrates that the decline in wine quality as acid content increases is not because of the increase in acidity, but in the decrease of volatile acidity and the increase in citric acid content.
This plot was chosen because it conveys one of the most interesting findings of this investigation. As alcohol content increases so does the quality of wine. High and medium quality wine are both above and below %12 alcohol, but above %12 all the wine (with the exception of one) is either high or medium quality.
This exploratory data analysis produced some interesting findings about what variables affect wine quality. The clearest findings are that volatile acid content negatively affects quality, citric acid and alcohol content positively affect wine quality.
Total acidity doesn’t seem to have much of an affect on the quality outcome. This could be because as total acidity increases it is because of an increase in both citric acid and volatile acidity. Since these are oppositely correlated they could be cancelling each other out and not affecting quality.
There are only 1599 wines in this data set, with 1,319 of those falling in the medium quality category. This left the investigation with a relatively low amount of wines in the low and high quality category. Additionally, the distribution of quality scores ranged from 3 to 8, it would be interesting to see how wines that score 9 or 10 would compare with the results found in this investigation.
It is also important to note that the quality of wine is a subjective measure. Who were the wine experts judging these wines? It is possible they have a taste for wine with a high citric acid content and high alcohol.